Add negative lookahead by ehuss · Pull Request #2172 · rust-lang/reference

ehuss · 2026-02-13T18:41:05Z

This adds the ! negative lookahead to the grammar to make it easier to express certain rules, and to remove some of the English-based rules.

This updates several rules to use !, and also fixes mistakes in several rules. See the individual commits for more details.

As part of this, it also adds the ability to specify U+xxxx Unicode values in character ranges, since it was needed to express some things without English rules.

traviscross

Looks good. Pushing some fixes and tweaks.

dev-guide/src/grammar.md

src/tokens.md

ehuss · 2026-02-18T01:54:26Z

dev-guide/src/grammar.md

    | NegativeExpression

-Unicode -> `U+` [`A`-`Z` `0`-`9`]4..4
+Unicode -> `U+` [`A`-`Z` `0`-`9`]4..6


Now that the new range syntax is in.

Suggested change

Unicode -> `U+` [`A`-`Z` `0`-`9`]4..6

Unicode -> `U+` [`A`-`Z` `0`-`9`]4..=6

This adds the `!` prefix which represents negative lookahead. This was included in the original PEG paper, though it was called "NOT", whereas I went with a more explicit "NegativeLookahead". This will be helpful in several productions which need to have these kinds of exclusions. The syntax is also commonly used in regular expression engines which usually use `(?!expr)`. This is also common in many other PEG libraries. There is a small risk this could be confusing, since `!` is sometimes used for other purposes in other contexts. For example, Prolog uses `!` for their cut operator. I think this should be fine since it is common with PEG.

This adds the ability to specify Unicode code points in a character range. This will be useful for defining some productions without using English, and perhaps to be a little clearer. This also extends the Unicode grammar to allow up to 6 characters for larger code points.

This replaces some suffixes and prose with the new negative lookahead syntax instead. This should all have the same meaning.

This clarifies that bare `//` is explicitly meant to be either followed by LF or EOF. Otherwise it incorrectly matches other comment rules.

This fixes the BLOCK_COMMENT grammar so that it follows the rule that the first alternation that matches wins. The previous grammar would fail with the use of the cut operator to parse these two forms.

This fixes the doc comments so that they properly handle a carriage return by using the cut operator. Rustc will fail parsing if a doc comment contains a carriage return. This requires including (LF|EOF) at the end of line so the cut operator has something to complete the line. This also removes the negative `/` from OUTER_LINE_DOC. This does not work correctly with the check for CR, and is not needed because LINE_COMMENT already matches `////`. Later I plan to include a rule for comments that makes it clear the order that they are parsed. A negative lookahead is necessary in OUTER_BLOCK_DOC to prevent it from trying to parse what should be a BLOCK_COMMENT as an OUTER_BLOCK_DOC and failing due to the cut operator.

This is intended to indicate the order that the rules are expected to be processed (as defined in this grammar). Of course real parsers can take a different approach if they have the same results. This is roughly similar to the order that rustc takes, though [`block_comment`](https://github.com/rust-lang/rust/blob/d7daac06d87e1252d10eaa44960164faac46beff/compiler/rustc_lexer/src/lib.rs#L782-L817) roughly takes the approach of combining the `/*` prefix, and then deciding if it is an inner doc comment, outer doc comment, or else a regular block comment. LINE_COMMENT must be first so that it is not confused with a doc comment. BLOCK_COMMENT must be last so that its cut operator does not interfere with doc comments that start with `/*`. It could be moved up higher in the list if it had negative lookahead to disambiguate OUTER_BLOCK_DOC, but the expression for that is more complicated than the one in OUTER_BLOCK_DOC.

rustc actually includes the spaces for doc comments.

The cut operator after (`e`|`E`) in `FLOAT_EXPONENT` reflects rustc's actual parsing behavior: once the lexer sees an exponent indicator, it commits and does not backtrack. This makes the last `RESERVED_NUMBER` alternative -- which existed to catch the empty-exponent case -- redundant, since the cut in `FLOAT_EXPONENT` now handles it directly. Co-authored-by: Eric Huss <eric@huss.org>

The description says characters can be "surrounded in backticks", but it'd be better to say "surrounded by".

The grammar now accepts 4-6 hex digits for Unicode code points (needed for values above U+FFFF), so let's update the notation column to reflect the variable width. Let's also capitalize "Unicode", which is a proper noun.

These tests cover: - Parser: negative lookahead with nonterminals, terminals, charsets, grouped expressions, within sequences, repetitions, and alternations; error case for trailing `!`; Unicode code points with 4, 5, and 6 hex digits; charset ranges with `Character::Char`, `Character::Unicode`, and mixed forms; charsets combining named entries, terminals, and Unicode ranges. - Markdown renderer: negative lookahead rendering with `!`, Unicode rendering as `U+xxxx`, charset rendering with char and Unicode ranges, cut and neg expression rendering, and markdown escaping. - Railroad renderer: negative lookahead renders as a "not followed by" labeled box, Unicode renders as terminal, charset ranges, cut renders as "no backtracking" labeled box, and neg expression renders as "with the exception of" labeled box.

rustbot · 2026-02-18T02:14:15Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rustbot added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Feb 13, 2026

ehuss force-pushed the negative-lookahead branch from 2bd22f8 to 3084dec Compare February 15, 2026 03:44

traviscross reviewed Feb 17, 2026

View reviewed changes

dev-guide/src/grammar.md Outdated Show resolved Hide resolved

src/tokens.md Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

ehuss commented Feb 18, 2026

View reviewed changes

ehuss and others added 12 commits February 18, 2026 02:05

Use negative lookahead in the grammar

999f883

This replaces some suffixes and prose with the new negative lookahead syntax instead. This should all have the same meaning.

Fix LINE_COMMENT grammar

cc7025c

This clarifies that bare `//` is explicitly meant to be either followed by LF or EOF. Otherwise it incorrectly matches other comment rules.

Fix BLOCK_COMMENT order

844b827

This fixes the BLOCK_COMMENT grammar so that it follows the rule that the first alternation that matches wins. The previous grammar would fail with the use of the cut operator to parse these two forms.

Fix desugaring of doc comments

7c12d35

rustc actually includes the spaces for doc comments.

Fix preposition in CharacterRange description

ae86f19

The description says characters can be "surrounded in backticks", but it'd be better to say "surrounded by".

Fix U+xxxx notation description

fc15897

The grammar now accepts 4-6 hex digits for Unicode code points (needed for values above U+FFFF), so let's update the notation column to reflect the variable width. Let's also capitalize "Unicode", which is a proper noun.

traviscross force-pushed the negative-lookahead branch from 372596f to 164f5fa Compare February 18, 2026 02:14

traviscross approved these changes Feb 18, 2026

View reviewed changes

traviscross added this pull request to the merge queue Feb 18, 2026

Merged via the queue into rust-lang:master with commit c6be577 Feb 18, 2026
6 checks passed

rustbot removed the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Feb 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add negative lookahead#2172

Add negative lookahead#2172
traviscross merged 12 commits intorust-lang:masterfrom
ehuss:negative-lookahead

ehuss commented Feb 13, 2026

Uh oh!

traviscross left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

ehuss Feb 18, 2026

Uh oh!

rustbot commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

	Unicode -> `U+` [`A`-`Z` `0`-`9`]4..6
	Unicode -> `U+` [`A`-`Z` `0`-`9`]4..=6

Conversation

ehuss commented Feb 13, 2026

Uh oh!

traviscross left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

ehuss Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

rustbot commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

traviscross left a comment •

edited

Loading